Skip to content

[RL] pause: use abort pipeline with scheduling loop alive for req drained#7753

Open
jackyYang6 wants to merge 1 commit into
PaddlePaddle:developfrom
jackyYang6:rl/pause-abort
Open

[RL] pause: use abort pipeline with scheduling loop alive for req drained#7753
jackyYang6 wants to merge 1 commit into
PaddlePaddle:developfrom
jackyYang6:rl/pause-abort

Conversation

@jackyYang6
Copy link
Copy Markdown
Contributor

@jackyYang6 jackyYang6 commented May 8, 2026

Depends-on: #7615 (refact abort_requests to fire-and-forget)

Motivation

In RL scenarios, the upstream framework calls abort_request followed by pause to stop the engine. The old _control_pause implementation had two critical issues:

  1. Lost partial results: preempted_all() + _send_error_response(500) discarded already-inferred tokens, returning error instead of partial results to clients.

  2. Deadlock with abort pipeline: Setting is_paused=True at the start blocked the scheduling loop (_pause_cond.wait_for), which prevented _trigger_abort from processing abort requests — causing a 30s timeout deadlock.

The new design separates "reject new requests" (_rejecting_new_requests) from "pause scheduling loop" (is_paused), allowing the abort pipeline to complete naturally before engine state reset. This ensures partial inference results are returned to clients via token_processor._put_abort_results (200 "Aborted") through the normal output path.

Modifications

fastdeploy/engine/common_engine.py

Change Description
self._rejecting_new_requests = False New flag in __init__ to decouple request rejection from scheduling loop pause
if self.is_paused or self._rejecting_new_requests: Request intake check now covers both states
_control_pause() rewrite Two-phase design: (1) reject + abort + drain, (2) pause + reset
_wait_inflight_drained() New method: polls until resource_manager.requests is empty

Execution flow

_control_pause:
  ├─ _rejecting_new_requests = True      (block new requests, scheduling loop alive)
  ├─ add_abort_req_ids(ALL)              (scheduling loop processes via _trigger_abort)
  ├─ _wait_inflight_drained()            (poll rm.requests empty)
  ├─ is_paused = True                    (now pause scheduling loop)
  └─cache reset

Usage or Command

# Pause (aborts all inflight requests with partial results, then resets engine)
curl -X POST http://localhost:8180/v1/pause

# Check paused state
curl http://localhost:8180/v1/is_paused

# Resume
curl -X POST http://localhost:8180/v1/resume

Accuracy Tests

Checklist

  • Add at least a tag in the PR title: [RL], [Engine]
  • Format your code, run pre-commit before commit.
  • Add unit tests. (unit test updated for test_control_pause_and_resume_paths)
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 8, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 8, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 16:54:42

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

1 个 Required 任务失败Approval 需指定 RD 审批通过),另有 2 个 Required 任务正在运行中,请关注结果。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
40(0) 40 35 2 2 1 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 11s PR问题:新增 llm_logger.info 调用,需指定 RD 审批 请 xyxinyang 或 zyyzghb review 并 approve Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 28/30 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m34s Job -
⏸️ CI_HPU - - -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR问题:新增 llm_logger.info 调用,需指定 RD 审批
  • 分析器: 通用分析(fallback)

根因详情:
PR 在 pause 逻辑中新增了两处 self.llm_logger.info(...) 调用,触发了 FastDeploy 的日志行为审批策略。scripts/check_approval.sh 检测到 diff 中包含日志行为修改(.info/.debug/.error/log_request),要求必须有 xyxinyang(zhouchong)zyyzghb(zhangyongyue) 中至少一人 approve 后才能通过。

关键日志:

Detected log modification in diff:
+        self.llm_logger.info(f"Pause: aborting {len(all_req_ids)} total requests.")
+        self.llm_logger.info(f"All inflight requests drained, take time: ...")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. xyxinyang(zhouchong)zyyzghb(zhangyongyue) 对此 PR 进行 review 并 approve

修复建议摘要: 请 xyxinyang 或 zyyzghb review 并 approve

关联变更: PR 在 pause 逻辑中新增了 self.llm_logger.info(...) 两行调用
链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

❌ Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8396ef6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 57.14% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7753   +/-   ##
==========================================
  Coverage           ?   63.17%           
==========================================
  Files              ?      461           
  Lines              ?    64121           
  Branches           ?     9821           
==========================================
  Hits               ?    40506           
  Misses             ?    20840           
  Partials           ?     2775           
Flag Coverage Δ
GPU 72.29% <57.14%> (?)
XPU 7.13% <7.14%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 15:51:59

📋 Review 摘要

PR 概述:重写 _control_pause() 实现两阶段暂停机制:先通过 abort 管道优雅中止所有请求(返回部分结果),再暂停调度循环,解决 RL 场景下 abort+pause 的死锁与丢失 partial result 问题。
变更范围engine/common_engine.pyengine/sched/resource_manager_v1.pyentrypoints/router/、相关测试
影响面 Tag[Engine] [APIServer] [Scheduler]

📝 PR 规范检查

描述结构合规(5 个必填 section 均存在且有内容)。

标题 tag [RL] 是合法官方 Tag,但按 architecture.md 影响面判断表,[RL] 对应 fastdeploy/rl/,本 PR 实际改动集中在 fastdeploy/engine/fastdeploy/entrypoints/,建议改为 [Engine](PR checklist 中作者自己也注明了 [RL], [Engine] 两个标签,Engine 更能准确描述变更范围)。

标题建议(可直接复制):

  • [Engine] pause: use abort pipeline with scheduling loop alive for graceful pause in RL scenarios

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/common_engine.py:1326 req_id 为 None 时警告后未 continueNone 被写入 waiting_abort_req_id_set
🟡 建议 fastdeploy/engine/common_engine.py:1504 _wait_inflight_drained() 无超时保护,worker 故障时控制线程永久卡死无法恢复
❓ 疑问 fastdeploy/engine/common_engine.py PR 描述 execution flow 列出的 handle scheduler stragglers_wait_output_queue_empty()scheduler.reset() 均未在代码中实现,描述与实现存在出入

总体评价

两阶段 pause 设计思路清晰,从根源解决了死锁与 partial result 丢失问题,abort 管道复用路径经过功能验证。存在一处 P0 bug(null req_id 未 continue 导致无效 abort 进入管道),以及无超时保护的潜在运维风险需关注。

Comment thread fastdeploy/engine/common_engine.py Outdated
"Receive abort request without request_id, skip invalid abort message"
)
self.llm_logger.info(f"Receive abort request, req_id: {req_id}")
self.resource_manager.add_abort_req_ids(req_id)

This comment was marked as outdated.

Comment thread fastdeploy/engine/common_engine.py
…l termination

Replace the old preempted_all + error_response approach in _control_pause
with a two-phase design:

Phase 1: Block new requests via _rejecting_new_requests (NOT is_paused)
  - Scheduling loop keeps running so _trigger_abort can process
  - add_abort_req_ids(ALL) marks all requests for abort
  - Scheduling loop catches them via _trigger_abort as they cycle through

Phase 2: After drain, set is_paused=True to fully stop scheduling loop
  - Handle scheduler-only stragglers with direct _send_error_response
  - Wait for output queue empty, then reset

Depends-on: PaddlePaddle#7615 (refact abort_requests to fire-and-forget)
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 16:20:55

📋 Review 摘要

PR 概述:重写 _control_pause() 实现,解决 RL 场景下 abort pipeline 死锁和丢弃 partial result 问题。
变更范围fastdeploy/engine/common_engine.pytests/engine/test_common_engine.py
影响面 Tag[Engine] [RL]

📝 PR 规范检查

PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 全部存在),Checklist 中 pre-commit 未勾选,请在合入前补充执行。

问题

级别 文件 概述
🟡 建议 fastdeploy/engine/common_engine.py:1491 _wait_inflight_drained() 无超时,abort pipeline 卡死时永久阻塞
❓ 疑问 fastdeploy/engine/common_engine.py:1463 PR 执行流描述了 scheduler.reset(),但代码中缺失;scheduler.responses 可能有残留
❓ 疑问 fastdeploy/engine/common_engine.py PR Modifications 表格声明了新方法 _wait_output_queue_empty(),但 diff 中完全没有该方法实现
❓ 疑问 fastdeploy/engine/common_engine.py token_processor.clear_data() 被移除,无注释说明 abort pipeline 是否覆盖了其清理职责

总体评价

整体设计思路清晰,两阶段分离(reject 与 pause)有效解决了死锁问题,accuracy tests 通过验证了功能正确性。但存在 PR 描述与实际实现的多处不一致(scheduler.reset()_wait_output_queue_empty() 描述有但代码没有),建议作者补充说明或补齐实现;_wait_inflight_drained() 的无超时设计建议增加兜底保护。

No timeout — abort pipeline will complete. Aligned with SGLang's poll-until-drained.
"""
start_time = time.time()
while (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _wait_inflight_drained() 无超时机制,可能导致永久阻塞

原代码对 worker queue 等待有 60s 超时并 raise Exception,新设计完全去掉了超时保护。注释说 "No timeout — abort pipeline will complete",但若 abort pipeline 因 bug 或异常卡住(如 worker hang、ZMQ 消息丢失),_control_pause() 将永远阻塞,上游 RL 框架调用方无法感知,造成静默挂起。

建议添加兜底超时:

DRAIN_TIMEOUT = 120
start_time = time.time()
while (self.resource_manager.requests or self.scheduler.requests
       or self.resource_manager.waiting_abort_req_id_set
       or self.resource_manager.to_be_aborted_req_id_set):
    if time.time() - start_time > DRAIN_TIMEOUT:
        self.llm_logger.error(f"Drain timed out after {DRAIN_TIMEOUT}s, abort pipeline may have stalled!")
        raise TimeoutError(f"_wait_inflight_drained timed out after {DRAIN_TIMEOUT}s")
    time.sleep(0.005)

self._send_error_response(req.request_id, "Request is aborted since engine is paused.")
self.scheduler.reset()

if envs.ENABLE_V1_KVCACHE_MANAGER:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 PR 执行流描述了 scheduler.reset(),但此处代码缺失

PR 描述的执行流程末尾明确写有 scheduler.reset() + cache reset,但实际只有 cache reset,self.scheduler.reset() 未被调用。

查看 local_scheduler.reset()(line 115-119)的实现,它会清空:

  • ids_read_cursor(重置为 0)
  • ids(所有历史请求 ID 列表)
  • requests(待处理请求字典)
  • responses(已接收响应字典)

_wait_inflight_drained() 只检查 requests 和 abort 队列为空,不检查 responses。若 scheduler.responses 中有残留未消费数据,resume 后可能产生状态不一致。

请确认:省略 scheduler.reset() 是有意为之(abort pipeline 的正常路径已保证 responses 被消费完),还是遗漏实现?

@jackyYang6 jackyYang6 changed the title [RL] pause: use abort pipeline with scheduling loop alive for gracefu… [RL] pause: use abort pipeline with scheduling loop alive for req drained May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants